-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement shared log pool for ZFS #14520
base: master
Are you sure you want to change the base?
Conversation
22c91d1
to
f75424e
Compare
d09d9d2
to
b1760f9
Compare
b1760f9
to
cea5128
Compare
8752e96
to
96c45af
Compare
Thanks for the detailed overview in the description. One thing that I didn't quite catch: since the txg cadence of each pool (shared log pool, and each of the client pools) is different, how are persistent state changes coordinated? e.g.
So each client txg, the client tells the shared pool about the new "head location" of each objset, which is stored in the shared pool's chain map. If the system crashes, I guess that the client pool assumes that the chain map has the new value. How do we ensure that the new value is persisted in the shared pool before the client pool's txg completes? Seems like doing txg_wait_synced() might be expensive, since we'd have to do it for every client pool txg, and there could be many client pools. |
96c45af
to
9574a60
Compare
We don't actually have to ensure that, usually. The updates to the chain map are always one of the following: 1) adding a new dataset to the map 2) removing a dataset from the map 3) moving the head of the zil chain further back along the chain. Case 1 only requires that the chain map be synced out before the dataset becomes writable for the first time. That we can spend a Case 2 doesn't require any synchronization, because we have a mechanism to clear out stale entries in the chain map. If the delete happens to get synced to the client but not the provider, the new GC ioctl will clear it out when it runs because there won't be a matching guid in the client pool anymore. We need the GC anyway to handle the case where pools are destroyed while not connected, so having it be the solution for this fairly narrow race seemed acceptable to me. Case 3 also doesn't require synchronization. Because we only ever move the head of the chain further down along the chain, you can't lose the chain if the crash happens before the shared log pool syncs out. When it comes back online, it'll scan the whole chain and claim the entire length. When the client comes back online, the replay will be mostly redundant blocks (which the ZIL already knows how to handle), and the next head update will cause the appropriate frees and update to the zap. It behaves very similarly to the way the ZIL does today if a crash happens, just with the start of the ZIL chain and the contents stored in a separate pool. |
That's neat. In Case 2 ( |
9574a60
to
4baeb42
Compare
@@ -3682,6 +3786,16 @@ zpool_do_import(int argc, char **argv) | |||
argc -= optind; | |||
argv += optind; | |||
|
|||
if (shared_log_pool != NULL && ! (flags & ZFS_IMPORT_MISSING_LOG)) { | |||
(void) fprintf(stderr, gettext("-L requires -m\n")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the answers my earlier question, -L
is only used if you're ignoring missing logs, does it change an existing non-client pool to be a client of the specified shared log pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. The man page covers it well enough, I think, but I could add a comment in the code as well.
Causes the pool to switch to using the specified shared log pool when | ||
imported. | ||
Requires the | ||
.Fl m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to discard missing log devices? What if there was no log device, only embedded logs, do those also get discarded? Why do we need to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we import with the shared log pool, the existing ZIL chains will be destroyed and new ones will be put in their place. The reason for that is that we can't really have both in place at once; only shared log xor normal slog, not both. Requiring the -m
flag to force the user to realize that they will be losing their existing SLOG devices as part of the transition seemed like a good safety mechanism to me.
I was looking into this possibility and it looks like I was mistaken, the GC doesn't actually clear the stale datasets if the pool is imported. It only cleans up entries for pools that are not present. The obstacle, in general, is that the chain map stores a list of ZILs by guid. There isn't a list, in the client pool, of all its datasets by guid. So in order to do the cleanup, either automatically or as part of the GC, we would need to iterate over every objset in the pool to find the ones with guids that are in the chain map but not in the pool and clean up the chain map entries. That could be quite slow, for pools with lots of datasets. This is probably worth doing so we don't leak space in the shared log pool (though the leak would be very slow, since usually ZILs don't have that many blocks in the chain when they are deleted, and only deletions that race with a crash or power outage would leak). We probably want to make it an asynchronous task, though, so that we don't further slow down pool import. |
I think you could do this efficiently via mark-and-sweep or a similar algorithm. e.g.: When opening a client pool, iterate over all its objset's (which we already do), and add all of their guids to a data structure sorted by guid (e.g. an AVL tree, in O(N * log(N)) time). Then create a similar structure of all the guids of objsets in the chain map (in O(M * log(M))). Now you can iterate them both in order (in O(N + M)) and find entries that are in the chain map but not the client pool, and delete those. |
During If the system crashes after the shared pool syncs and before the client pool syncs, we would get a leaked entry in the chain map. I think that this would show up the same way as a crash during/after |
An update: This actually turned out to not be necessary. I was misremembering the design and misreading the code; the id we use to index the objsets in the chain map is their objset ID, not a guid. As a result, we don't need to do mark and sweep. We can simply iterate over the objsets in the chain map for a given client when we import it, and add any that are no longer present to the spa_zil_deletes the way we would for an ordinary deletion. They'll get processed when a TXG syncs out.
This is (now) implemented in
Agreed; a destroy where the client finishes and the shared log doesn't is symmetrical to a creation where the shared log finishes and the client doesn't, so the same process should handle both. |
24747ad
to
b333ccc
Compare
@@ -355,7 +358,8 @@ get_usage(zpool_help_t idx) | |||
case HELP_CLEAR: | |||
return (gettext("\tclear [-nF] <pool> [device]\n")); | |||
case HELP_CREATE: | |||
return (gettext("\tcreate [-fnd] [-o property=value] ... \n" | |||
return (gettext("\tcreate [-fndL] [-l pool] ... \n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using -l pool
for shared logs, did you consider adding a new "reserved name" like sharedlog
to use in the vdev list? Possibly we could even overload the log
reserved name and use a shared log if we find a pool name rather than a device name (although there is, I suppose, the possibility that the shared log pool name is also a valid device name...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered it briefly. The downside is that the implementation is annoying; the vdev-tree-construction code doesn't really have a way to pass back a pool-level property like the fact that we want to be using a shared log pool, so getting the information out in the right format is more annoying if we make it part of the vdev code.
tests/zfs-tests/tests/functional/shared_log/shared_log_004_pos.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/shared_log/shared_log_006_neg.ksh
Outdated
Show resolved
Hide resolved
e8a9615
to
05ad004
Compare
68ba95f
to
7f85a9a
Compare
Signed-off-by: Paul Dagnelie <[email protected]>
Hello, everyone. Will this PR be released in version 2.3? |
Signed-off-by: Paul Dagnelie <[email protected]>
a8972ac
to
0ddbdfd
Compare
Once this rebase onto |
a32adac
to
41aa23c
Compare
Signed-off-by: Paul Dagnelie <[email protected]>
41aa23c
to
99e7dac
Compare
Signed-off-by: Paul Dagnelie <[email protected]>
dp->dp_chain_map_obj = zap_create_flags(dp->dp_meta_objset, 0, | ||
ZAP_FLAG_HASH64 | ZAP_FLAG_UINT64_KEY | | ||
ZAP_FLAG_PRE_HASHED_KEY, DMU_OTN_ZAP_METADATA, | ||
chain_map_zap_default_bs, chain_map_zap_default_ibs, | ||
DMU_OT_NONE, 0, tx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure ZAP_FLAG_PRE_HASHED_KEY is right here, since it uses only the first 64bit of the key, which cover a pool GUID, but not objsets within, so there might be plenty of collisions. And in theory, if I remember ZAP code right, may reach a point of ZAP leaf overflow, since unlike specification we do not support leaf chaining and with identical hash we might not be able to split the leaf.
if (mc->mc_ops->msop_type == METASLAB_TYPE_VIRTUAL) { | ||
ASSERT3P(mc->mc_virtual, !=, NULL); | ||
spa_t *target_spa = mc->mc_virtual; | ||
dmu_tx_t *tx = dmu_tx_create_mos(target_spa->spa_dsl_pool); | ||
VERIFY0(dmu_tx_assign(tx, TXG_WAIT | TXG_NOTHROTTLE)); | ||
uint64_t target_txg = dmu_tx_get_txg(tx); | ||
int ret = metaslab_alloc(target_spa, | ||
spa_normal_class(target_spa), psize, bp, ndvas, target_txg, | ||
hintbp, flags, zal, zio, allocator); | ||
dmu_tx_commit(tx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are allocating space in transaction of different pool and without storing the pointer anywhere within that transaction. What protect us from a space leak here if the transaction get committed immediately and we crash just here before we do anything else?
@@ -618,7 +618,7 @@ range_tree_verify_not_present(range_tree_t *rt, uint64_t off, uint64_t size) | |||
{ | |||
range_seg_t *rs = range_tree_find(rt, off, size); | |||
if (rs != NULL) | |||
panic("segment already in tree; rs=%p", (void *)rs); | |||
panic("segment already in tree; rt=%px rs=%px", rt, (void *)rs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the %px
a Linux-specific extension?
PS: Since this patch is already big, it would be nice to move all unrelated chunks (here and above) out.
static int | ||
get_shared_log_pool(nvlist_t *config, spa_t **out) | ||
{ | ||
if (config == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetics, but do we really need to check the argument here, and if really so, why 0
, not NULL
?
if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &guid)) | ||
return (0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can pool really have no GUID?
if (shared_log != NULL || fnvlist_lookup_boolean(config, | ||
ZPOOL_CONFIG_IS_SHARED_LOG)) { | ||
spa->spa_chain_map_taskq = taskq_create("z_chain_map", 100, | ||
defclsyspri, 1, INT_MAX, TASKQ_DYNAMIC | | ||
TASKQ_THREADS_CPU_PCT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't red to the point of this taskq usage, but I wonder whether we really need 100% of CPU threads per pool for this? And do I read it right that it is for both parent and children pools?
@@ -2769,14 +2919,16 @@ spa_load_verify_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp, | |||
*/ | |||
if (!spa_load_verify_metadata) | |||
return (0); | |||
if (zilog && zil_shared_log(zilog)) | |||
io_spa = spa_get_shared_log_pool(spa); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetics, but wouldn't log_spa
or zil_spa
be more specific?
{ | ||
(void) arg; | ||
int error = metaslab_claim(spa, bp, | ||
spa_get_dsl(spa)->dp_tx.tx_open_txg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a level of suspicion: you are claiming blocks in whatever the current open txg is, while as I see normal ZIL uses dmu_tx_create_assigned(dp, spa_first_txg(spa))
result for all the claiming process.
(void) arg; | ||
lr_write_t *lr = (lr_write_t *)lrc; | ||
blkptr_t *bp = &lr->lr_blkptr; | ||
if (lrc->lrc_txtype != TX_WRITE || BP_IS_HOLE(bp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you are duplicating normal ZIL claiming logic here, and that normal logic is actually now handles also TX_CLONE_RANGE
, that you seems not.
int error = metaslab_claim(spa, bp, | ||
spa_get_dsl(spa)->dp_tx.tx_open_txg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TX_WRITE
and TX_CLONE_RANGE
expected to have a block pointers from the child pool, where actual data are stored. Meanwhile it seems you are passing spa of the log pool here. Am I wrong? And respectively those blocks must be claimed in context of that pool during its import, not here during shared log pool import.
Motivation and Context
The ZIL is ZFS's system for handling synchronous operations. These operations need to be persisted to some sort of durable media quickly to comply with APIs and provide good performance. The information that makes up the ZIL is usually stored in a SLOG device or an embedded SLOG within a normal device. This solution works very well for systems with single ZFS storage pools, but for systems with several pools, having a SLOG for each one can be tricky to manage. Striking a balance between having too much space and not enough for each of the pools is challenging. This problem is only compounded if the number of pools changes over time. In that case, new disks would need to be added, or disks would need to be repartitioned. These challenges are among the problems that were intended to be solved by ZFS's pool architecture in the first place. While this is certainly not the most common use case for ZFS, there are situations where it is used, usually for performance reasons.
Description
This change introduces the concept of a Shared LOG Pool to ZFS. This pool can be used as the SLOG by other pools, and does not store its own user data (though it does have its own metadata). This allows the SLOG devices to be pooled and shared efficiently.
When a Shared LOG Pool is created, it has a few special properties. First, a shared log pool is identified by the presence of a key in the pool config. Second, shared log pools cannot have filesystems created in them, and their default root filesystem isn't mounted. Third, a new ZAP is created in the MOS, which stores the chain map, which will be explained later. When a pool is created, a flag can be passed along with the name of a Shared LOG Pool to instruct the newly created pool to use the Shared LOG Pool as its log device. We refer to this newly created pool as a "client" pool. When that occurs, a number of things happen.
First, the client pool cannot have any other log devices while it uses the shared log pool. This includes only using one shared log pool at a time. While it is not totally impossible to implement support for this, the complexity was not deemed worth the investment and risk.
Second, the client pool is registered with the shared log pool, and has its dependence on the shared log pool marked in its config. The pool cannot be imported without the shared log pool unless
-m
is passed to import discard the logs, and the shared log pool cannot be exported or deleted until the client pool is exported or deleted.Third, the log metaslab class in the client pool gets marked as virtual, and contains a pointer to the shared log pool's spa. Any allocations to the log class go instead to the shared log pool and are allocated from the normal class there. Some extra data structures are also allocated in the client pool's SPA in memory, which will be discussed later.
Fourth, each ZIL that gets created (there's one for each filesystem/zvol) needs to be stored in the shared log pool, rather than in the client pool. However, storing blkptrs to data in another pool is hazardous; vdev IDs will not match, ashifts may differ, and it is generally tricky to handle the separate space accounting requirements of each pool. These are the problems the chain map data structure is intended to solve.
The chain map is a map from a dataset in a specific pool to the head of its ZIL in the shared log pool. This data structure is stored in the shared log pool, and is stored in memory and on disk. On disk, it is stored as a zap from two uint64 keys (pool guid and objset id) to a blkptr (the first block in the zil chain for that objset). In memory, it is stored as an AVL tree of per-pool structs, each of which contains an AVL tree of per-objset structures storing the block pointer. It is updated in the shared log pool's open context using the normal
dmu_tx
APIs to modify the on-disk structure, during thespa_sync
of each client pool. The chain map is primarily used in two situations. First, when a client pool is imported, the ZILs in each of its filesystems has a hole blkptr as the head of the ZIL chain. The client pool uses this as a signal to ask the shared log pool for the actual head blkptr, which is only ever stored in memory in the client pool's context. This allows the client pool to do replay when it is mounted. Second, when the shared log pool is imported, the chain map is iterated over. Each ZIL chain in the map then gets iterated over so that the claim process can occur, protecting the ZIL chain blocks from being re-allocated by the shared log pool if their allocation did not get synced to disk due to an unclean shutdown or export.The chain map is updated by the client pools as follows. Each client pool has two relevant data structures: a
spa_zil_map
, which is an avl tree that stores per-objset data structures, each of which has a list of all the ZIL blocks that got allocated for that objset's ZIL since the last TXG synced. Second, there is thespa_zil_deletes
, a list which stores all the ZILs that were deleted since the last TXG synced. At the end ofspa_sync
, the client pool iterates over the each objset in thespa_zil_map
and does two things. First, it tells the shared log pool that it can free each zil block that is in the chain but is behind where the new head location is for this TXG. Second, it tells the shared log pool about the new head location for this TXG, so it can update the in-memory and on-disk chain map. The client pool then iterates over thespa_zil_deletes
and tells the shared log pool that it can free every block in the chain and delete the entries in the in-memory and on-disk data structures. In this way the chain map is kept up to date with the client pool's view of the world and all the unneeded ZIL blocks can be freed and reclaimed as quickly as possible.These are the primary features of the shared log architecture, but there are some supporting changes as well. Since it's possible to overwrite a client pool without ever informing the shared log pool, or a race can occur with filesystem deletion that causes the filesystem to be marked as deleted on the client without being marked as deleted in the provider, we need some way to GC stale entries in the chain map. A new ioctl is added to support that, which will GC the chains of all the pools in the chain map that are not currently imported and active.
We also have some changes to the import process to allow client pools to be imported with a new shared log pool or no shared log pool by abandoning their logs. There are also changes to zdb and the command line utilities to support the new features. New tests are added, and the ZFS performance test suite has been enhanced to support pools with SLOGs and shared logs. A new featureflag is added. Finally, there is a fix for a race condition in the ZIL included in these changes.
There was a narrow race window where an LWB can finish its IO and enter
zil_lwb_write_done
and set its state toZIL_LWB_WRITE_DONE
. If the next LWB then enterszil_lwb_write_open
and callszil_lwb_set_zio_dependency
. It sees that the previous lwb is not null and not flushed yet, so it sets the root IO to be a dependent, but when it gets to the second check, the previous lwb’s state isWRITE_DONE
, so it doesn’t set thewrite_zio
to have a child. The IO then issues and completes, and the nlwbdone_func
gets called and sets thenlwb->lwb_state = LWB_STATE_WRITE_DONE
. The old lwb’s execution ofzil_lwb_write_done
then resumes and goes intoflush_defer
, and we see a panic because the nlwb’s state isWRITE_DONE
. We fix the race by holding thezl_lock
slightly longer inzil_lwb_write_done
, so that either the nlwb will do set dependency before the lwb can set its own state to WRITE_DONE (and so it will mark itself as a parent IO and its done func will block) or the nlwb will not be able to do zio_rewrite until after the lwb has grabbed thelwb_vdev_lock
inzil_lwb_flush_defer
. This still leaves a small window where the nlwb finishes its IO, goes into its done func, and starts issuing flushes while the lwb is still trying to manipulate the linked list, which we fix by holding thelwb_vdev_lock
while issuing the flushes, which will only be contended in the race condition case.There are some restrictions associated with shared log pools and client pools. First, reguiding is forbidden. I'm not sure how often this is used, so I'm not really concerned about this restriction, but if the feature is important for someone's use case it could be implemented (The real problem is synchronizing across txgs, it may require use of a ZIL in the shared log pool). Second, checkpoints are forbidden. Checkpointing client pools is not too bad in theory; you simply stop updating the chain map, and then when the checkpoint is deleted, resume freeing. If you roll back instead, you need free all the blocks after the tail of the ZIL chain at the txg the checkpoint was taken. This is also not too bad, since the
lr_t
has the txg of the record. This was not implemented since it wasn't deemed necessary for the MVP, but may be done in a future patch. Checkpointing the shared log pool is not especially useful, since the client pool won't understand what's happening. A rollback would cause ZIL entries to vanish without explanation.How Has This Been Tested?
New unit tests have been added for many of the basic workflow cases described here. I have also manually tested several other cases. Performance testing was also performed; because the new code is almost all gated behind a check that
spa_uses_shared_log
, there is effectively no performance degredation in the normal use case (within the margin of error when testing). Testing was also performed with a shared log pool to see how the performance compares to a normal pool with a SLOG. On a variety of sync-write workloads, shared log performance (measured in bytes of throughput the pool sustained) was within 2% of the SLOG pool, except in tests where large numbers of filesystems are receiving sync writes at once. In the case of 64 filesystems being accessed in parallel, performance was 93% of the SLOG case. The primary difference appears to be time spent waiting for thespa_zil_map_lock
. Some work was done to reduce contention on this lock, but more progress is possible if needed.Types of changes
Checklist:
Signed-off-by
.